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List of Figures 

1 (a) Standard C-SVM like penalty function penalizes yi{f3 xi + I^q) < pi. In B-SVM, 
pi replaces the constant 1 from C-SVM. (b) Novel B-SVM penalty function. This 
function penalizes yi{P Xi + j3q) > p2. (c) Total penalty function for B-SVM. If 
yj(/3 Xi + /3o) S [/9i,/02] then the total penalty is 0. Choosing C2 < Ci will impose 

a milder penalty for values of yi (/9 Xi + fio) > p2 5 

2 Figure shows classification obtained for example data using (a) C-SVM and (b) B- 
SVM. Red and Blue points (.) correspond to class -|-1 and —1 respectively. Cyan 
and Orange x-marks (x) show the C-SVM and B-SVM decision rules evaluated at 
various points. Class 1 membership is indicated in Cyan and class —1 membership 
is indicated in Orange. The squares in (a) correspond to support points 
for which < Oj < C. The cyan squares in (b) correspond to support points for 
which Q < 9i < C2 and the green squares correspond to support points for which 
< ttj < Ci. The sparsity of solution is controlled by a in the case of C-SVM and 
(a — 0) in the case of B-SVM (c) Shows «« values for C-SVM. (d) Shows (oj — 9i) 
values for B-SVM 11 



Figure shows decision rule g{x) for C-SVM (a) and B-SVM (b). Note that in B- 
SVM the second penalty term C2 X]"=i[yi(/3 h{xi) + /3o) — P2]+ results in most of 
the g{x) values in the interval [/9i,/92] = [1, 1-5]. (c) Heat map of the decision rule 
g{x) for C-SVM (d) Heat map of the decision rule g{x) for B-SVM. In C-SVM the 
values of decision rule g{x) are unbalanced in Class 1. The central cluster located at 
(0,0) in Class 1 gets much smaller g{x) values in C-SVM than the rest of the Class 
1. In B-SVM however, all clusters in Class 1 including the one centered at (0,0) 
get similar g{x) values. This is a result of the second penalty term in the B-SVM 

objective function 12 

Figure shows the fraction of points classified correctly by both C-SVM (blue curve) 
and B-SVM (red curve) as a function of the decision rule threshold. The x-axis 
shows the decision rule threshold as a percentage of the maximum absolute value 
of the decision function g{x) over all training points. The y-axis shows the overall 
classification accuracy or sensitivity of C-SVM and B-SVM 13 



Abstract 

We describe a novel binary classification technique called Banded SVM (B-SVM). In the standard 
C-SVM formulation of Cortes and Vapnik [1995], the decision rule is encouraged to lie in the interval 
[1,00]. The new B-SVM objective function contains a penalty term that encourages the decision 
rule to lie in a user specified range [pi, /O2]. In addition to the standard set of support vectors (SVs) 
near the class boundaries, B-SVM results in a second set of SVs in the interior of each class. 



Notation 

^ Scalars and functions will be denoted in a non-bold font (e.g., /3o,C, (7). Vectors and vector 
functions will be denoted in a bold font using lower case letters (e.g., x,f3,h). Matrices will 
be denoted in bold font using upper case letters (e.g., B,H). The transpose of a matrix A 
will be denoted by A^ and its inverse will be denoted by A~^. Ip will denote the p x p 
identity matrix and will denote a vector or matrix of all zeros whose size should be clear 
from context. 

^ |x| will denote the absolute value of x and I{x > a) is an indicator function that returns 1 if 
X > a and otherwise. 

^ The jth component of vector t will be denoted by tj. The element {i,j) of matrix G will be de- 
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noted by G{i,j) or dj. The 2-norm of apxl vector x will be denoted by ||£c||2 = +\/Yl^.=i 
Probability distribution of a random vector x will be denoted by Px{x). E [/(s,r/)] denotes 
the expectation of f{s, rf) with respect to both random variables s and ry. 



1 Introduction 

We consider the standard binary classification problem. Suppose yi is the class membership label 
(+1 for class +1 and —1 for class —1) associated with a feature vector Xi. Given n such {xi,yi) 
pairs, we would like to learn a linear decision rule g{x) that can be used to accurately predict the 
class label y associated with feature vector x. 

In C-SVM [Vapnik and Lerner, 1963, Boser et al., 1992, Cortes and Vapnik, 1995], one can think 
of the linear decision rule g as & means of measuring membership in a particular class. Given a 
feature vector a;, C-SVM encourages the function g{x) to be positive if a; G class +1 and negative 
if a; G class —1. 

We motivate the development of B-SVM in the following way. Suppose that vector x comes from 
an arbitrary probability distribution Y'xix) with mean E[a3] = // and finite co-variance Cov[k] = XI. 
Consider the linear decision rule g{x) = (3 x + /3q. It is easy to see that g{x) has mean E[(7(a;)] = 
/3 fx + (3o and covariance CoY[g{x)] = (3 51/3. By Chebyshev's inequality, there exists a high 
probability band around E[(7(a;)] where g{x) is expected to lie when x comes from Pa;(a3). 

Hence, for every probability distribution of vectors x from class +1 and class —1 with finite co- 
variance, g{x) is expected to lie in a certain high probability band. In B-SVM, we choose g{x) to 
encourage: 

'^' yg{x) > °^ same condition as C-SVM 

^ y g{x) £ certain high probability band "^^ new B-SVM condition 

Both of the above conditions can be satisfied if we encourage: 



yg{x) G[pi,p2\ with p2> pi> (1.1) 



Since non-linear decision rules in C-SVM are simply linear decision rules operating in a high dimen- 
sional space via the kernel trick [Boser et al., 1992], the B-SVM band formation argument holds 
for non-linear decision rules as well. 



2 Problem setup 



As per standard SVM terminology, assume that we are given n data-label pairs (xi,yi) where Xi 
are m x 1 vectors and the data labels yi S {—1,1}. First, we consider only the linear case and 
afterwards transform to the general case via the kernel trick. Let m x 1 vector (3 and scalar /3o be 
parameters of a linear decision rule g{x) = f3 a; -|- /3o = separating class -|-1 and —1 such that 
^(a;) > if a; belongs to class -|-1 and vice versa. 



2.1 C-SVM objective function 

The C-SVM objective function [Cortes and Vapnik, 1995] to be minimized can be written as: 

n 

fcsvMi(3,f3o) = -||/3||i + C^[1- y,if3^Xi + /3o)]+ (2.1) 

where [t]+ is the positive part of t: 



2' 



r n if t < 0, , , 

and C governs the regularity of the solution. The C-SVM objective function penalizes signed 
decisions yi{P xi + Pq) whenever their value is below 1. This is the only penalty in C-SVM. 

2.2 B-SVM objective function 

We present below the novel B-SVM objective function that we wish to minimize: 

^ n n 

fBSVMi(3,f3o) = ^\ml + Ci Y^iPi - ViiP^^i + /3o)]+ + C2 Y^hiP^^i + /3o) - P2]+ (2.3) 



C-SVM like penalty novel B-SVM penalty 

where P2 > pi > are margin parameters specified by the user and Ci and C2 are regularization 
constants. This objective function has two penalty terms: 

^ The first penalty term is similar to C-SVM. It penalizes signed decisions yi{P Xi + /3q) 
whenever their values are below pi (as opposed to 1 in C-SVM). 

'^' The second penalty term is novel. It penalizes signed decisions yi(/3 Xi -|- /3o) when their 
values are above p2- 

The net effect of these penalty terms is to encourage yi(/3 Xi + /3o) to lie in the interval [pi,p2]- 
Please see Figure 1 for a sketch of the two penalty terms in B-SVM. 



3 Solving the B-SVM problem 

We derive the B-SVM dual problem in order to maximize a lower bound on the B-SVM primal 
objective function in equation 2.3. This dual problem will be simpler to solve compared to the 
primal form 2.3. We proceed as follows: 




(c) 



Figure 1: (a) Standard C-SVM like penalty function penalizes yi{P xi + /3o) < pi- In B-SVM, pi 
replaces the constant 1 from C-SVM. (b) Novel B-SVM penalty function. This function penalizes 
yj(/3 Xi + Po) > p2- (c) Total penalty function for B-SVM. liyi{/3 Xi + f^Q) G [pi,P2] then the total 
penalty is 0. Choosing C2 < Ci will impose a milder penalty for values of yi{f3 Xi + Pq) > p2. 



^ As shown in 3.2, the primal problem in 2.3 can be modified into a strictly convex objective 
function with linear inequality constraints using slack variables. 

^ Consequently, strong duality holds and the maximum value of the B-SVM dual objective 
function is equal to the minimum value of the B-SVM primal objective function in 2.3. 

For more details on convex duality, please see Nocedal and Wright [2006]. 

3.1 The B-SVM dual problem 

We introduce slack variables: 

Ci = [pi-yi{(3^Xi + (3o)]+ (3.1) 

m = [yi{0^xi + Pq) - p2\+ 

into the primal objective function in 2.3. The modified optimization problem can be written 
as: 

^ n n 

min /B5V'M(/3,/3o,l,r7) = -||/3||i + CiV?i + C72Vr?i (3.2) 

1=1 1=1 

^j > Lagrange multiplier /ij 

r/j > Lagrange multiplier V'i 

ii> Pi- yi{0^Xi + /3o) Lagrange multiplier Ui 

Vi > -P2 + ViiP'^Xi + /3o) Lagrange multiplier 9i 

After introducing Lagrange multipliers for each inequality constraint as shown in 3.2, the La- 
grangian function for problem 3.2 can be written as: 

^ n n n 

L{l3,Po,tV,c^,O,ti,iP) = ^m\l + Ci^^i + C2^Vi-^M^i-Pi+yi{0^Xi + Po)} (3.3) 

i=l j=l i=l 

n n n 

- ^ ^iim + P2- viiP^Xi + /3o)} - ^ piCi - ^ TpiVi 

i=l j=l «=1 

where 

ai,9i,pi,ilJi>0 (3.4) 

Next, we solve for primal variables /3, /3o, ^, r/ in terms of the dual variables a,0, ^,ip by minimizing 
L(/3, /3o, $, Vi CK, 9, /x, -0) with respect to the primal variables. Since the Lagrangian in 3.3 is a convex 
function of the primal variables, its unique global minimum can be obtained using the first order 
Karush Kuhn Tucker (KKT) conditions given in 3.5 - 3.8: 

^ i=l 1=1 

6 



dL 

Wo 



^ aiVi + ^ OiVi = 



j=i 



i=l 



aL 



Ci-ai- iJLi = Q 
■C2-ei-i;i = 



Prom 3.5, the vector /3 is given by: 

n 

/3 = ^{tti -0i)yiXi 

i=l 

Prom 3.6, vectors a and 6 satisfy the equahty constraint: 

n 

Y,{ai-ei)yi = Q 

i=l 

Combining 3.7, 3.8 and 3.4, the elements of ex must satisfy: 

< Oj < Ci 
and elements of satisfy: 



Let S be a n X n matrix with entries: 



< ^i < C2 



ij — ViVj '^i "^j 



(3.6) 
(3.7) 
(3.8) 



(3.9) 



(3.10) 



(3.11) 
(3.12) 

(3.13) 



and e„ be a n X 1 vector of n ones (in MATLAB notation: e„ = ones(n, 1)). Substituting /3 from 
3.9 in 3.3 and noting the constraints 3.7, 3.8 and 3.10, we get the B-SVM dual problem: 



(3.14) 



max L/)(a,0) = pie^a - P2e^0 - 

<x,0 


-^(°- 


-efB{cx- 


-6) 


< Q < Ci e„ 








< 6» < C2 e„ 








{a - efy = 









If C2 = and pi 
problem. 



1 then 3.12 implies = and hence we recover the standard C-SVM dual 



3.2 Kernelifying B-SVM 

Let h he a non-linear vector function that takes inputs Xi into a high dimensional space. Then 
we recover kernel B-SVM by doing linear B-SVM on the data-label pairs {h{xi),yi) instead of 
the original pairs {xi,yi). In practice, we do not need h{x) explicitly but only the dot products 
through a kernel matrix K with elements: 

Kij = K{xi,Xj) = h{xi) h{xj) (3.15) 

This is the so-called kernel trick. From 3.13, elements of matrix B for transformed feature vectors 
h{x) are given by: 

Bij = ViVj h{xifh{xj) = yiyj Kij = ytyj K{xi, Xj) (3.16) 

For a new point a;, the decision rule is then given by: 

g{x)=(3^h{x)+PQ (3.17) 

and X is classified into class +1 if g{x) > and into class —1 if g{x) < 0. From 3.9, for the 
transformed feature vectors h{xi), we have: 



(3 = ^(oj -9i)yih{xi) 



(3.18) 



2=1 



Using the kernel trick, calculation of g{x) does not need h{x) explicitly as we can write: 



g{x) = 0^h{x) + /3o = X]("* ~ ^») y^- -^(^»' ^) + /^o 



i=l 



(3.19) 



Proposition 3.1. The B-SVM dual objective function L£,[a^6) in 3.14 is a concave function of 
a and 6. 



Proof. Since B is symmetric, the Hessian of L/) with respect to the vector (a, 0) is given by: 

«=(-# _^^) (3.20) 

If c and d are arbitrary n x 1 vectors. 



(c- <i-)H(^) 



„T( 



iT 



Be + Bd) + d' {Be - Bd) = -(c - dy B{e-d) 



(3.21) 



From 3.16, 



{e-d)'^B{e-d) = '^^{e-d)i{yiyjK{xi,Xj)}{e-d)j = '^'^{{e-d)iyi}K{xi,Xj){{e-d)jyj} 

i=l j=l i=l jr'=l 

(3.22) 



If is an element-wise multiplication operator then: 

(c - dfB (c - d) = {(c - d) y}^K{{c -d)Qy}>0 (3.23) 

where the last inequality holds since K is a kernel matrix which is positive definite by 3.15. There- 
fore, from 3.21 and 3.23: 

(c^ d^)//Q<0 (3.24) 

for all vectors c and d. Thus Ld^q., 6) is a concave function of (a, 0). □ □ 

It immediately follows that problem 3.14 attempts to maximize a concave function under linear 
constraints and thus has a unique solution [Nocedal and Wright, 2006]. 

3.3 Calculation of dual variables 

Dual variables Q, 0, /x, i/> can be calculated as follows: 

^ Calculation of o;, requires the solution of a concave maximization problem 3.14 where 
the elements of B are chosen using a suitable kernel K{xi,Xj). This can be accomplished 
using an sequential minimal optimization (SMO) type active set technique [Piatt, 1998] or a 
projected conjugate gradient (PCG) technique [Nocedal and Wright, 2006]. 

^ Once a and are known, equations 3.7 and 3.8 give n = Cie„ — a and tp = C2e„ — 6. 

3.4 Calculation of primal variables 

Primal variables /3, (3o, ^, rj can be calculated as follows: 

^ /3 is given by equation 3.18. 

^ Calculation of /3o, $, rj is accomplished by considering the inequality constraints and the 
KKT complementarity constraints for the problem 3.2: 

Ci>0,Vi>0 (3.25) 

Ci > Pi - Vi iP^Hxi) + Po) 
m > -P2 + Vi {0^h{xi) + /3o) 
adii -Pi+Vt {0^h{xi) + /3o)} = 

Oiim + P2-y^ iP^Hxi) + /3o)} = 

Mi6 = (Ci - ai)6 = 
ipiTli = (C2 - 9i)r]i = 

9 



Given the positivity constraints 3.4 and the bound constraints 3.11 and 3.12, we consider the 
following cases: 

^®° If Oi < Ci then S,i = and similarly if 6i < C2 then rji = 0. 

^^^ If < aj < Ci then we have ^i = and {S,i — pi + yi{(3'^Xi + /3o)} = which can be used 
to solve for j3q. 

"®° If < 6*4 < C2 then we have jyj = and {rn + P2 — Vi {0^h{xi) + j3q)} = which can be 
used to solve for /3o. 

^'^ Similar to C-SVM, for stability purposes we can average the estimate of /3o over all points 
where < Oj < Ci and < 0,; < C2. 

^^° We can calculate ^j for those points for which Ui = Ci using ^j = Pi—yi (/3 h{xi) + /3o) . 
^ Similarly, if Oi = C2 then r?i = yi {0^h{xi) + /3o) - p2. 

4 Toy data 

In order to illustrate the differences between C-SVM and B-SVM we generated artificial data in 2 
dimensions as follows: 

^ Class 1 consisted of 5 bivariate Normal clusters centered at (0,0), (^j^s); (^,^); {^^^) 
and (^, ^) and covariance 0"f/2 with ui = 0.2. 



^ Class —1 consisted of 4 bivariate Normal clusters centered at (1, 0), (0, 1), (—1, 0) and (0, —1) 



with covariacne (T2I2 with 02 = 0.2. 



A radial basis function (RBF) kernel was chosen for computations. For the RBF kernel, the elements 
of K are given by: 

K{xi, Xj) = Kij = exp |-7 [xi - Xj)^ {xi - Xj)\ (4.1) 

Our parameter settings were as follows: 

^ For both C-SVM and B-SVM we used the same kernel parameter 7 = 1. 

^ For C-SVM was used C = 10. 

^' For B-SVM we chose pi = I and Ci = 10 (same as C for C-SVM). Thus the parameters of 
the common penalty term Ci Yll=i[P^ ~ Vii^ h{xi) + /3o)]+ are chosen to be identical for 
C-SVM and B-SVM. 

^ The parameters of the second penalty term for B-SVM were chosen as C2 = 100 and p2 = 1.5. 
Thus B-SVM will encourage g{x) to lie in the interval [/9i,/02] = [1, 1-5]. 

10 
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(b) B-SVM classification 
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Figure 2: Figure shows classification obtained for example data using (a) C-SVM and (b) B-SVM. 
Red and Blue points (.) correspond to class -|-1 and —1 respectively. Cyan and Orange x-marks 
(x) show the C-SVM and B-SVM decision rules evaluated at various points. Class 1 membership 
is indicated in Cyan and class —1 membership is indicated in Orange. The squares in 

(a) correspond to support points for which < aj < C. The cyan squares in (b) correspond to 
support points for which < 6i < C2 and the green squares correspond to support points for which 
< Oi < Ci. The sparsity of solution is controlled by a in the case of C-SVM and (a — 9) in the 
case of B-SVM (c) Shows Oi values for C-SVM. (d) Shows (a^ — 9i) values for B-SVM. 
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(c) C-SVM g{x) heatmap 
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Figure 3: Figure shows decision rule g{x) for C-SVM (a) and B-SVM (b). Note that in B-SVM 
the second penalty term C2 J2^=i[yiif^'^ ^i^i) + M — P2]+ results in most of the g{x) values in the 
interval [pi,p2] = [1, 1-5]. (c) Heat map of the decision rule g{x) for C-SVM (d) Heat map of the 
decision rule g{x) for B-SVM. In C-SVM the values of decision rule g{x) are unbalanced in Class 
1. The central cluster located at (0,0) in Class 1 gets much smaller g{x) values in C-SVM than 
the rest of the Class 1. In B-SVM however, all clusters in Class 1 including the one centered at 
(0, 0) get similar g{x) values. This is a result of the second penalty term in the B-SVM objective 
function. 
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Figure 4: Figure shows the fraction of points classified correctly by both C-SVM (blue curve) and 
B-SVM (red curve) as a function of the decision rule threshold. The j;-axis shows the decision rule 
threshold as a percentage of the maximum absolute value of the decision function g{x) over all 
training points. The y-axis shows the overall classification accuracy or sensitivity of C-SVM and 
B-SVM. 
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Both C-SVM and B-SVM were fitted to the toy data described above. The following differences in 
the two solutions are noteworthy: 

4.1 a-SVs and ^-SVs 

The B-SVM dual problem 3.14 contains two variables o: and 6. Both Oi and 6i are positive and 
satisfy the bound constraints given in 3.14. Therefore, similar to C-SVM, we define 2 types of 
support vectors (SVs) in B-SVM: 

^ Points i for which ^i > are called the 9-SYs "^ new SVs that arise in B-SVM 

^ Points i for which Oj > are called the a-SVs "^^ standard C-SVM like SVs 

Figures 2(a) and 2(b) show the C-SVM and B-SVM induced classification respectively for this 
example problem. Figure 2(b) shows a-SVs for which < Oi < Ci and 6-SVs for which < 9i < C2- 
It is clear from 3.19 that the sparsity of a B-SVM decision rule depends on the quantities (aj — 6i). 
Figures 2(c) and 2(d) show a plot of Oj for C-SVM and («» — Oi) for B-SVM respectively. 

4.2 Bounded decision rule 

Figures 3(a) and 3(b) show the decision rule values g{x) over all training points for C-SVM and 
B-SVM. Recall that C-SVM does not enforce an upper limit on g{x) whereas B-SVM attempts to 
encourage g{x) to lie in [pi, p2]- It can be seen in Figure 3(b) that B-SVM was successful in limiting 
the absolute value of g{x) to he < p2 = 1-5 with C2 = 100. Figures 3(c) and 3(d) show a heat map 
of the decision rule for C-SVM and B-SVM respectively evaluated over a 2-D grid containing the 
training points. It can be seen that: 

^ The C-SVM decision rule values are unbalanced in class -|-1 as the central cluster in class -|-1 
gets lower g{x) values compared to other clusters in class -|-1. 

^ The decision rule values are balanced in class -|-1 for B-SVM. 

4.3 Sensitivity curve 

We calculate the quantity: 



1 " 

S{t) = -y2l[y,gixi)>t] (4.2) 



n 

i=l 



which is simply the fraction of correctly classified points (or sensitivity) using decision rule g{x) at 
threshold t. To illustrate the variation in sensitivity of C-SVM and B-SVM decision rules: 
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^ For both C-SVM and B-SVM, we divide the range of g{x) into 50 equaUy spaced points as 
follows (in MATLAB notation): 

t = linspace(0,maxa; \g{x)\ ,50) (4-3) 

^ Then we plot 100 x ( max!l(^)l ) ^^"^^^3 S{tj). 



max^|g(a;) 

Figure 4 shows this sensitivity curve. It can be seen that for the same percentage threshold on the 
decision rule range: 

^ B-SVM has higher classification accuracy (or is more sensitive) than C-SVM. 

^ This effect is because of the balanced nature of decision rule values in B-SVM compared to 
C-SVM (see Figure 3(c) and 3(d)). 



5 Discussion and conclusions 

In this work, we considered the binary classification problem when the feature vectors in individual 
classes have finite co-variance. We showed that B-SVM is a natural generalization to C-SVM in 
this situation. It turns out that the B-SVM dual maximization problem 3.14 retains the concavity 
property of its C-SVM counterpart and C-SVM turns out to be a special case of B-SVM when 
C2 = 0. Two types of SVs arise in B-SVM, the a-SVs which are similar to the standard SVs in 
C-SVM and 0-SVs which arise due to the novel B-SVM objective function penalty 2.3. The B-SVM 
decision rule is more balanced than the C-SVM decision rule since it assigns g{x) values that are 
comparable in magnitude to different sub-classes (or clusters) of class +1 and class —1. In addition, 
B-SVM retains higher classification accuracy compared to C-SVM as the decision rule threshold is 
varied from to maxa, 15(3^)1. For a training set of size n, B-SVM results in a dual optimization 
problem of size 2n compared to a C-SVM dual problem of size n. Hence it is computationally more 
expensive to solve a B-SVM problem. 

In summary, B-SVM can be used to enforce balanced decision rules in binary classification. It is 
anticipated that the C-SVM leave one out error bounds for the bias free case given in Jaakkola and 
Haussler [1999] will continue to hold in a similar form for bias free B-SVM as well. 
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